Creates Structure for Upload Requests #61

gbdubs · 2023-11-16T17:04:40Z

The request is two step: (a) create the upload URLs and blobs, then after the assets are uploaded (b) kick off a process for parsing explicitly.

Sending the PR for the interface first - trying to get smaller ones going.

bcspragu

LGTM, though I'm not sure I understand the API:

Users request to upload N items
Users get N upload URLs
Users upload N files
Users say "Hey, I uploaded " - One weakness here is that clients don't need to report the same number N of files here
Users get back some ID? And the task starts running?

A more "bulletproof" API would be something like:

Create a portfolio (or portfolio group, or process group whatever primitive files get attached to), get an ID back
Request to upload files to that portfolio ID
Can do step 2 as many times as is useful, can be batch or not batch, doesn't matter
Mark files completed after they're uploaded. We could do this behind the scenes (as noted elsewhere, using blob storage events), but we could also just roll this validation step into step 5 below.
Request to start the processing

From the web client's perspective, this whole process could be kicked off in one go once they select their files, it just makes the API slightly harder to misuse, intentionally or accidentally.

openapi/pacta.yaml

bcspragu · 2023-11-17T02:02:33Z

openapi/pacta.yaml

+    CompletePortfolioUpload:
+      type: object
+      required:
+        - incomplete_upload_ids


nit: Confusing name, since they are completed from the user's perspective, they're done uploading. I'd just call them upload_ids, it's unambiguous here.

I'd prefer to keep this as is - I think drift between terminology should be avoided when it's all internal only.

openapi/pacta.yaml

gbdubs · 2023-11-17T16:39:02Z

LGTM, though I'm not sure I understand the API:

Users request to upload N items

Users get N upload URLs

Users upload N files

Users say "Hey, I uploaded " - One weakness here is that clients don't need to report the same number N of files here

Users get back some ID? And the task starts running?

A more "bulletproof" API would be something like:

Create a portfolio (or portfolio group, or process group whatever primitive files get attached to), get an ID back

Request to upload files to that portfolio ID

Can do step 2 as many times as is useful, can be batch or not batch, doesn't matter

Mark files completed after they're uploaded. We could do this behind the scenes (as noted elsewhere, using blob storage events), but we could also just roll this validation step into step 5 below.

Request to start the processing

From the web client's perspective, this whole process could be kicked off in one go once they select their files, it just makes the API slightly harder to misuse, intentionally or accidentally.

OK FWIW, I think this is mostly doing what you think it is, just with a few translations.

an Incomplete Upload is a file that has been uploaded, but hasn't parsed successfully, either because it failed parsing, or because it is currently processing, or hasn't yet been processed. This staging ground type of approach allows us to have every portfolio in the system correspond to a unit that is fit for any type of processing. The requirements specify things like deleting incomplete uploads by default, and building in that kind of logic over portfolios would have been riskier, etc. When an incomplete upload successfully parses into (one or more) portfolios, we delete the incomplete upload.
The "Marking as completed" is just the request to start processing - it's just Completes Initial DB Layer #5. There is no Initial Audit Log DB Layer + Tests #4.
The reason I think it is stronger to request the set of incomplete uploads to process is because if the incomplete upload fails (i.e. file too big and rejected by cloud storage, ex), the client can be smarter about how it handles partial success cases on the frontend, with no requirement for storage on the backend (i.e. we would have to record more concretely the ties between which blobs were uploaded and which were associated with a non-started task, ex, if we didn't specify which things to run when we say "start the run".

Does that make sense?

Creates Structure for Upload Requests

8b79ae4

gbdubs requested a review from bcspragu November 16, 2023 17:04

bcspragu approved these changes Nov 17, 2023

View reviewed changes

Merge branch 'main' into grady/portfolio-upload

309f3d7

Addresses Review Comments, Adds Server Stubs

cab67c7

gbdubs requested a review from bcspragu November 17, 2023 17:12

gbdubs merged commit b714121 into main Nov 17, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creates Structure for Upload Requests #61

Creates Structure for Upload Requests #61

gbdubs commented Nov 16, 2023

bcspragu left a comment

bcspragu Nov 17, 2023

gbdubs Nov 17, 2023

gbdubs commented Nov 17, 2023

Creates Structure for Upload Requests #61

Creates Structure for Upload Requests #61

Conversation

gbdubs commented Nov 16, 2023

bcspragu left a comment

Choose a reason for hiding this comment

bcspragu Nov 17, 2023

Choose a reason for hiding this comment

gbdubs Nov 17, 2023

Choose a reason for hiding this comment

gbdubs commented Nov 17, 2023